Skip to content

Conversation

Restodecoca
Copy link
Contributor

@Restodecoca Restodecoca commented Oct 4, 2025

Description

This PR introduces the ParadeDB vector store integration, extending the PostgreSQL-based store to support BM25 + vector search using ParadeDB.

This implementation is based on llama-index-vector-stores-postgres, and has been refactored to inherit directly from PGVectorStore, reducing duplicated logic and ensuring full compatibility with the PostgreSQL backend.

It also supports for custom query execution, enabling advanced hybrid retrieval use cases through ParadeDB’s enhanced search engine.

This PR:

  • Adds full ParadeDB compatibility (schema, extensions, and BM25 index creation).
  • Supports hybrid dense + sparse retrieval (BM25 + embeddings) via ParadeDB’s pg_search extension.
  • Refactors the ParadeDB store to extend PGVectorStore while overriding BM25-specific behavior.
  • Reintroduces custom query support, improving flexibility for advanced search operations.
  • Includes a new README.md with instalation.

Fixes

Fixes #
(or leave blank if this is a new feature without a linked issue)


New Package?

  • Yesllama-index-vector-stores-paradedb
  • No

A detailed README.md was added with usage examples, setup instructions, and integration notes.
The tool.poetry.dependencies.llama-index-vector-stores-postgres reference is also declared in pyproject.toml.


Version Bump

  • Yes — bumped version to 0.1.0
  • No

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Added new unit tests to cover BM25 ranking and hybrid retrieval
  • Verified compatibility with PostgreSQL and ParadeDB backends
  • Compared ts_vector vs BM25 ranking results
  • Confirmed all existing tests (pytest) pass successfully
  • Additional manual validation in Dockerized environments

Example Output

--- TSVECTOR RESULTS ---
Rank 1: ID=ccc, Score=0.06079
Rank 2: ID=ddd, Score=0.06079

--- BM25 RESULTS ---
Rank 1: ID=ddd, Score=0.67853
Rank 2: ID=ccc, Score=0.50741

image

Suggested Checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation (README.md)
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have added Google Colab support for the newly added notebooks
  • I have added tests that prove my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run ruff check --fix . and uv run ruff format . to appease the lint gods

Summary

  • ParadeDB now fully inherits from PGVectorStore, providing a cleaner, more maintainable implementation.
  • BM25 and hybrid retrieval are both supported natively.
  • Custom query execution is re-enabled, restoring flexibility for advanced retrieval logic.
  • 2K lines reduce while retaining full test coverage.

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Oct 4, 2025
Copy link
Collaborator

@logan-markewich logan-markewich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this really need to copy paste 1K lines from the postgres integration vs. just subclassing it?

@Restodecoca
Copy link
Contributor Author

Did this really need to copy past 1K lines from the postgres integration vs. just subclassing it?

Yeah, i think you're right, i'm gonna redo it by subclassing pgvector to simplify, thanks for the feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants